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Abstract 

Dictionary learning for sparse representations is traditionally approached with 
sequential atom updates, in which an optimized atom is used immediately for the 
optimization of the next atoms. We propose instead a Jacobi version, in which 
groups of atoms are updated independently, in parallel. Extensive numerical 
evidence for sparse image representation shows that the parallel algorithms, 
especially when all atoms are updated simultaneously, give better dictionaries 
than their sequential counterparts. 
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1. Introduction 

The sparse representations field is the basis for a wide range of very effective 
signal processing techniques with numerous applications for, but not limited to, 
audio and image processing. 

In this article, we approach the problem of training dictionaries for sparse 
representations by learning from a representative data set. Given a set of signals 
Y G and a sparsity level s, the goal is to find a dictionary D G that 

minimizes the Frobenius norm of the approximation error 

E = Y-DX, (1) 

where X G is the associated s-sparse representations matrix, with at 

most s nonzero elements on each column. Otherwise said, each column (signal 
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or data vector) from Y is represented as a linear combination of at most s 
columns (atoms) from D. To eliminate the magnitude ambiguity in this bilinear 
problem, where both D and X are unknown, the columns of the dictionary are 
constrained to unit norm. 

Since dictionary learning (DL) for sparse representations is a hard problem, 
the most successful algorithms, like K-SVD |T] (and its approximate version 
AK-SVD 12) and MOD |2, adopt an alternating optimization procedure with 
two basic stages. First, fixing the current dictionary D (initialized randomly or 
with a subset of Y) , the sparse representations X are computed with Orthogonal 
Matching Pursuit (OMP) [3] or another algorithm. Then, keeping X fixed, a 
new dictionary is obtained through various techniques. The second stage, where 
the atoms of the dictionary are updated, makes the main difference between DL 
algorithms. Recent methods or improvements can be found in 0, 0, E, ID- 
Some of them will be discussed later, since they are used for supporting our 
method. Among the algorithms related to DL, but with more constraints on 
the dictionary, are m, cni, la- Overviews of earlier work and applications are 
presented in muD. 

All these DL algorithms update the atoms one by one, in Gauss-Seidel style. 
The motivation is the classical one: an updated atom, assumed to be better than 
its previous value, can be used immediately for other updates. We investigate 
here the Jacobi version of several algorithms, where groups of atoms are updated 
simultaneously. We started this work in na, where our study was confined to 
AK-SVD, aiming at reducing the dictionary design time on a GPU architecture. 
However, extensive numerical evidence shows that not only this strategy is not 
worse than the standard sequential approach, but in many circumstances gives 
a smaller representation error 0. This manuscript presents the Jacobi atom 
updates (JAU) strategy in section its particular form for a few of the best 
sequential methods in section and the above mentioned numerical evidence in 
section ID 

As a side remark, the name "parallel atom updates" (PAU) is at least as 
good as JAU to label our approach. Unfortunately, this name was already used 
in |S] although the atoms are updated sequentially there, using several AK- 
SVD update sweeps. An idea similar with PAU is called more appropriately 
"dictionary update cycles" in [S], in the context of K-SVD. 

2. Jacobi Atom Updates Stategy 

The general form of the proposed dictionary learning method with Jacobi 
atom updates is presented in Algorithmic At iteration k of the DL method, the 
two usual stages are performed. In step 1, the current dictionary and the 
signals Y are used to find the sparse representation matrix X^^'^ with s nonzero 
elements on each column; we used OMP, as widely done in the literature. 

The atom update stage takes place in groups of h atoms. We assume that 
n divides n only for the simplicity of description, but this is not a mandatory 
condition. Steps 2 and 3 of Algorithm [C perform a full sweep of the atoms. All 
the n atoms from the same group are updated independently (step 4), using one 
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Algorithm 1: General structure of a DL-JAU iteration 

Data: current dictionary G 

signals set Y € 
number of parallel atoms h 
Result: next dictionary 

1 Compute s-sparse representations G such that Y « 

2 for £ = 1 to n/n do 

3 for j = {i — l)n + 1 to in, in parallel do 

4 Update 

5 Normalize: 


of the various available rules; some of them will be discussed in the next section. 
Once a group is processed, its updated atoms are used for updating the other 
atoms; so, atom (column j of is computed in step 4 using 

if 

L(*-l)/hJ<L(j-l)/nJ, (2) 

(k) 

i.e. i < j and di not in the same group as dj, and d] otherwise. 

Putting n = 1 gives the usual sequential Gauss-Seidel form. Taking n = n 
leads to a fully parallel update, i.e. the form that is typically labeled with 
Jacobi’s name. 

The specific atom update strategy of each algorithm is contained in step 4 
while step 5 is the usual normalization constraint on the dictionary. 

The proposed form has obvious potential for a smaller execution time on a 
parallel architecture. We lightly touch this issue here and provide comparative 
execution times from a few experiments in section the reader can consult m 
for an in-depth analysis of the GPU implementation of the AK-SVD algorithm. 
Our main focus here is on the quality of the designed dictionary. 

3. Particular forms of the algorithm 

Typically, the atom update problem is posed as follows. We have the dictio¬ 
nary, denoted generically D, and the associated representations matrix X and 
we want to optimize atom dj. In the DL context, at iteration k of the learning 
process, the dictionary is made of atoms from and as explained by 

the phrase around equation Q. We denote Ij the (column) indices of the sig¬ 
nals that use dj in their representation, i.e. the indices of the nonzero elements 
on the j-th row of X. Excluding atom dj, the representation error matrix (Q, 
reduced to the relevant columns, becomes 

F = Ei^ +djXj^x^. (3) 
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The updated atom dj is the solution of the optimization problem 


min \\F - dj XjXiWF 

djGRP J J 


( 4 ) 


The norm constraint ||dj ||2 = 1 is usually imposed after solving the optimization 
problem. 

AK-SVD. The K-SVD algorithm and its approximate version AK-SVD 
treat Q by considering that Xj;Zj is also a variable. Problem Q becomes a 
rank-1 approximation problem that is solved by AK-SVD with a single iteration 
of the power method (to avoid ambiguity, we add superscripts representing the 
iteration number): 

df+^^ =F{xf}f /\\F{xf}fh 

Sk+l) _ T^k+l) ^ ^ 

Note that the representations are also changed in the atom update stage, which 
is the specific of this approach. 

SGK. Dictionary learning for sparse representations as a generalization of 
K-Means clustering (SGK) [5] solves directly problem Q. This is a least squares 
problem whose solution is 


,(fc+i) 


= FxlxJ{,Xj,x,xlx^ 


( 6 ) 


The atom updates part of the general JAU scheme from algorithm [l] has the 
form described by algorithm]^ named P-SGK (with P from Parallel). The error 
E is recomputed in step 2 before each group of atom updates, thus taking into 
account the updated values of the previous groups. Depending on the value of 
h, the error can be computed more efficiently via updates to its previous value 
instead of a full recomputation. Steps 4 and 5 implement relations ^ and ([^, 
respectively. Step 6, the normalization, is identical with that from the general 
scheme. 

To obtain the JAU version of AK-SVD (named PAK-SVD), we replace step 
4 by the operations from ([^. Note that, for full parallelism (n = n), P-SGK and 
PAK-SVD are identical, since the atoms produced by ^ and ^ have the same 
direction. For full parallelism the representations updated by PAK-SVD are not 
used, while if h < n, some updated representations affect the error matrix from 
step 2. 

NSGK. The update problem Q is treated in [7] in terms of differences with 
respect to the current dictionary and representations, instead of working directly 
with D and X. Applying this idea to SGK, the optimization problem is similar, 
but with the signal matrix Y replaced by 


z = y-k (7) 


where is the sparse representation matrix at the beginning of the Re¬ 

iteration of the DL algorithm, while X^^'> is the matrix computed in the fc-th 
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Algorithm 2: P-SGK Atom Updates 


Data: current dictionary D G 

signals set Y G 
sparse representations X G 
number of parallel atoms h 
Result: next dictionary D 


1 for 

2 

3 

4 

5 

6 


i = l to n/fi do 
E=Y-DX 

for j = {i — l)n + 1 to £n, in parallel do 
F = Exj + djXj^Xj 
dj = Px^,iJ{xj^x,x^^x,) 

dj ^ dj/\\dj\\2 


iteration (e.g. in step 1 of Algorithm]^. The P-NSGK algorithm (NSGK stands 
for New SGK, the name used in |7]) is thus identical with P-SGK, with step 2 
modified according to (Q. Also, in (|^, the representations Xj^Xj are taken from 
not from X^^'> as for the other methods. 

4. Numerical results 

We give here numerical evidence supporting the advantages of the JAU 
scheme, for two standard problems: recovery of a given dictionary and DL 
for sparse image representation. We compare the JAU algorithms PAK-SVD, 
P-SGK and P-NSGK with their sequential counterparts. We report results ob¬ 
tained with the same input data for all the algorithms; in particular, the initial 
dictionary is the same. The sparse representations were computed via OMFQ 

4-.1. Dictionary recovery 

Following the numerical experiments from [B] and [7], we generated a random 
dictionary with n = 50 atoms of size p = 20 each, and a signal set Y of m = 1500 
data vectors, each vector being generated as a linear combination of s G {3,4, 5} 
different atoms. We then added white gaussian noise with SNR levels of 10, 20, 
30 and oo dB to the signal set. We applied 9s^ dictionary learning iterations on 
this signal set for each algorithm and compared the resulting dictionaries with 
the original in the same way as in [T]. The dictionary was initialized with a 
random selection of data vectors. The algorithms were given the fixed sparsity 
target s that was used to generate the original signal set. JAU methods used 
full parallelism (n = n). The percentages of recovered atoms, averaged over 50 


^ We used OMP-Box version 10 available at http://www.cs.technion.ac.il/~ronrubin/ 
software.html 
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Table 1: Percentage of recovered atoms 


s 

Method 

10 

SNR 

20 30 

OO 


NSGK 

87.16 

90.16 

89.32 

89.56 


P-NSGK 

88.36 

89.64 

89.92 

89.76 

3 

SGK 

87.44 

89.40 

88.80 

90.12 


P-SGK 

86.48 

89.84 

89.00 

88.24 


AK-SVD 

86.20 

90.00 

90.00 

88.52 


NSGK 

70.68 

91.88 

92.16 

93.16 


P-NSGK 

68.08 

92.48 

92.88 

93.28 

4 

SGK 

67.28 

92.16 

91.68 

91.92 


P-SGK 

68.28 

93.48 

91.88 

92.56 


AK-SVD 

70.08 

92.76 

92.16 

92.32 


NSGK 

10.24 

92.36 

92.72 

94.40 


P-NSGK 

10.60 

93.08 

93.32 

94.72 

5 

SGK 

11.68 

93.28 

92.60 

93.92 


P-SGK 

12.04 

93.00 

92.92 

93.92 


AK-SVD 

11.64 

92.72 

92.88 

94.96 


runs, are presented in table (PAK-SVD is not reported, since it gives the 
same results as P-SGK.) 

We note that, although the JAU algorithms give the best result in 6 out 
of the 12 considered problems, the results are rather similar for all algorithms. 
(We may infer that, in this problem, the sparse representation stage is the 
bottleneck, not the atom update stage.) We can at least conclude that, for 
dictionary recovery, the JAU scheme is not worse than the sequential ones. 

4-.2. Dictionary learning 

The training signals were images taken from the USC-SIPI m database 
(e.g. barb, lena, boat, etc.). The images were normalized and split into random 
8x8 blocks. The initial dictionary was built with random atoms. 

In a first experiment, we used m = 32768 signals of dimension p = 64 to 
train dictionaries with n = 512 atoms, with a target sparsity of s = 8. In 
figure we can see the evolution of the representation RMSE, averaged over 
10 runs, for the JAU and sequential algorithms. JAU algorithms have full 
parallelism [n = n). (Note that the PAK-SVD and P-SGK curves are slightly 
different, due to the computation of the errors at the end of a DL iteration, 
where PAK-SVD has different representations; otherwise, the dictionaries are 
identical.) Although the sequential algorithms have smoother convergence, the 
proposed parallel versions obtain clearly better results. Among the sequential 
algorithms, NSGK is the best, confirming the findings from [7]. However, all 
parallel algorithms are better than NSGK. 

The same conclusion is supported by a second experiment, where the condi¬ 
tions are similar but, for faster execution, only m = 16384 training signals were 
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Figure 1: Error evolution for parallel and sequential algorithms. 
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Table 2: Best RMSE values after 200 iterations 



n = 128 

n = 256 

n = 512 

NSGK 

0.0185 

0.0168 

0.0154 

P-NSGK 

0.0167 

0.0154 

0.0139 

SGK 

0.0201 

0.0185 

0.0166 

P-SGK 

0.0165 

0.0153 

0.0138 

AK-SVD 

0.0201 

0.0184 

0.0163 


used. Tablej^shows the lowest RMSE after k = 200 iterations, averaged over 10 
runs, for three values of the dictionary size n. In all cases, the JAU algorithms 
are clearly the best. 

To show the influence of the group size n of the JAU algorithms, we present 
the evolution of the error, averaged over 10 runs, in figure We see that the 
effect of group size is less intuitive. Full parallelism (n = n) is the winner, 
although some smaller values of n are good competitors, almost all being better 
than the sequential version {n = 1). Although in this example n = 256 is worse 
than some smaller values, the error usually decreases as n grows, the best value 
being n = n in all our tests, for all parallel methods. A possible explanation 
is that the JAU strategy, due to the independent atom updates, is less prone 
to get trapped in local minima. Modifying atoms one by one, although locally 
optimal, may imply only small modifications of the atoms; in contrast, JAU 
appears to be able of larger updates that make convergence more erratic, but 
can reach a better dictionary. 

4-3. JAU versus MOD 

We now compare the performance in representation error of the JAU al¬ 
gorithms with the intrinsically parallel algorithm named method of optimal 
directions (MOD) jj]. MOD uses OMP for representation and updates the dic¬ 
tionary D with the least-squares solution of the linear system DX = Y. For 
completeness we also include the sequential versions on which JAU algorithms 
are built. 

In figures [3]-|^ we depict the JAU algorithms with green, the sequential ver¬ 
sions with red and MOD with black. All algorithms performed DL for k = 200 
iterations. Each data point from these figures represents an average of 10 runs 
of the same algorithm with the same parametrization and data dimensions but 
with training sets composed of different image patches. 

To see how sparsity influences the end result, figure [^presents the final errors 
for sparsity levels starting from s = 4uptos = 12 when performing DL for 
dictionaries of n = 128 atoms on training sets of size m = 8192. We notice that 
for all three algorithms (NSGK, SGK and AK-SVD) the JAU methods perform 
similar to MOD at lower sparsity constraints, but as we pass s = 8 our proposed 
parallel strategy is clearly better. The sequential versions always come in last, 
except perhaps for NSGK that comes close to MOD past s = 10. 












Figure 2: P-NSGK error evolution for various group sizes. 



Figure 3: Final errors for different sparsity constraints. The sequential versions 
are red, JAU algorithms are green and MOD is black. 
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Figure 4: Final errors for varied dictionary sizes. The sequential versions are 
red, JAU algorithms are green and MOD is black. 


Figure presents the final errors for DL on training sets of m = 12288 
signals, with a sparsity constraint of s = 12, when varying the total number of 
atoms in the dictionary from n = 128 to n = 512 in increments of 64. Again, 
the JAU versions are the winners for all three algorithms. Out of the sequential 
algorithms, NSGK is the only one that manages to out-perform MOD, while 
the others lag behind coming in last. 

The next experiment investigates the influence of the signal set size on the 
final errors. In figure we kept a fixed dictionary size of n = 256 and a sparsity 
of s = 10 and performed DL starting with training sets of m = 4096 signals that 
we increased in increments of 1024 up to m = 16384. JAU stays ahead of MOD 
almost everywhere, except for small signal sets with m < 5000 where the results 
are similar. The sequential versions are once again the poorest performers. 

Finally, we present in flgure|^the error improvement at each iteration for all 
algorithms, for several sparsity levels. In this experiment we used a dictionary 
of n = 128 atoms and a training set of m = 8192 signals. We can see that the 
JAU versions can jump back and forwards, specially during the first iterations. 
We think that this is due to the parallel update of the dictionary atoms which 
leads to jumps from one local minima to another until a stable point is reached. 
This is, perhaps, the reason why in the end it manages to provide a lower 
representation error. Even though the JAU convergence is not as smooth as 
MOD or the sequential versions, it has a consistent descendent trend. 
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Figure 5: Final errors for varied training set sizes. The sequential versions are 
red, JAU algorithms green and MOD is black. 
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Figure 6: Error evolution at different sparsity constraints. The sequential ver¬ 
sions are red, JAU algorithms green and MOD is black. 
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Figure 7: Execution times. The sequential versions are red and the JAU algo¬ 
rithms are green. 


4-.4- Execution times 

We now focus on the improvements in execution time. We used OpenCL for 
our GPU implementation of the JAU algorithms and performed the execution on 
an ATI FirePro V8800 (FireGL V) card from AMD, running at a maximum clock 
frequency of 825MHz, having 1600 streaming processors, 2GB global memory 
and 32KB local memory. For the sequential versions we used a G implementation 
that kept identical instructions everywhere where possible in order to provide 
an accurate comparison that clearly shows the improvements in execution time 
brought, almost exclusively, by the JAU strategy. The sequential tests were 
executed on an Intel i7-3930K GPU running at a maximum clock frequency of 
3.2GHz. 

We present 3 experiments in figure where we vary the sparsity constraint, 
the atoms in the dictionary and the number of signals in the training set. We 
depict the JAU versions with green and the sequential versions with red. Be¬ 
cause of the significant difference in execution time, we use a logarithmic scale. 
Again, we used k = 200 iterations for all methods. 

For the sparsity experiment we used a dictionary of n = 128 and a training 
set of m = 8192 and we increased the sparsity from s = 4 to s = 12 in increments 
of 2. When studying the dictionary impact on the execution performance we 
kept a fixed training set of m = 12288 and a sparsity of s = 6 and varied the 
atoms from n = 128 in increments of 64 up to n = 512. Finally, we increased 
the signal set in increments of 1024 starting from m = 4096 until m = 16384 
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with a fixed dictionary of n = 256 and a sparsity of s = 10. In the panel the 
abscissa tics represent thousands of signals. 

In all of our experiments the JAU versions showed important improvements 
in execution time, the speed-up reaching values as high as 10.6 times for NSGK, 
10.8 times for SGK and 12 times for AK-SVD. This was to be expected, since 
JAU algorithms are naturally parallel in the atom update stage. 

5. Conclusions 

We have shown that several dictionary learning algorithms, like AK-SVD [2], 
SGK j6] and NSGK [7], benefit from adopting Jacobi (parallel) atom updates 
instead of the usual Gauss-Seidel (sequential) ones. We have also shown that 
the new Jacobi algorithms outperform their sequential standard versions and 
also other types of algorithms like MOD [3]. In the mostly academic dictionary 
recovery problem, the parallel and sequential versions have similar performance. 
However, in the more practical problem of dictionary learning for sparse image 
representation, the proposed parallel algorithms have a clearly better behavior 
with superior execution times. 
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